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Abstract 

Many problems of low-level computer vision and image processing, such as denoising, de- 
convolution, tomographic reconstruction or super-resolution, can be addressed by maximizing 
the posterior distribution of a sparse linear model (SLM). We show how higher-order Bayesian 
decision-making problems, such as optimizing image acquisition in magnetic resonance scanners, 
can be addressed by querying the SLM posterior covariance, unrelated to the density's mode. We 
propose a scalable algorithmic framework, with which SLM posteriors over full, high-resolution 
images can be approximated for the first time, solving a variational optimization problem which 
is convex iff posterior mode finding is convex. These methods successfully drive the optimization 
of sampling trajectories for real-world magnetic resonance imaging through Bayesian experimen- 
tal design, which has not been attempted before. Our methodology provides new insight into 
similarities and differences between sparse reconstruction and approximate Bayesian inference, 
and has important implications for compressive sensing of real-world images. Parts of this work 
appeared at conferences ^33j Qfl\ ■ 



1 Introduction 

Natural images have a sparse low- level statistical signature, represented in the prior distribution 
of a sparse linear model (SLM). Imaging problems such as reconstruction, denoising or deconvolu- 
tion can successfully be solved by maximizing its posterior density (maximum a posteriori (MAP) 
estimation), a convex program for certain SLMs, for which efficient solvers are available. The suc- 
cess of these techniques in modern imaging practice has somewhat shrouded their limited scope 
as Bayesian techniques: of all the information in the posterior distribution, they make use of the 
density's mode only. 

Consider the problem of optimizing image acquisition, our major motivation in this work. Mag- 
netic resonance images are reconstructed from Fourier samples. With scan time proportional to 
the number of samples, a central question to ask is which sampling designs of minimum size still 
lead to MAP reconstructions of diagnostically useful image quality? This is not a direct recon- 
struction problem, the focus is on improving measurement designs to better support subsequent 
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reconstruction. Goal-directed acquisition optimization cannot sensibly be addressed by MAP es- 
timation, yet we show how to successfully drive it by Bayesian posterior information beyond the 
mode. Advanced decision-making of this kind needs uncertainty quantification (posterior covari- 
ance) rather than point estimation, requiring us to step out of the sparse reconstruction scenario 
and approximate sparse Bayesian inference instead. 

The Bayesian inference relaxation we focus on is not new [23l [T7] , yet when it comes to problem 
characterization or efficient algorithms, previous inference work lags far behind standards estab- 
lished for MAP reconstruction. Our contributions range from theoretical characterizations over 
novel scalable solvers to applications not previously attempted. The inference relaxation is shown 
to be a convex optimization problem if and only if this holds for MAP estimation (Section [3]), 
a property not previously established for this or any other SLM inference approximation. More- 
over, we develop novel scalable double loop algorithms to solve the variational problem orders of 
magnitude faster than previous methods we are aware of (Section |4]). These algorithms expose an 
important link between variational Bayesian inference and sparse MAP reconstruction, reducing 
the former to calling variants of the latter few times, interleaved by Gaussian covariance (or PCA) 
approximations (Section 4.4). By way of this reduction, the massive recent interest in MAP estima- 
tion can play a role for variational Bayesian inference just as well. To complement these similarities 
and clarify confusion in the literature, we discuss computational and statistical differences of sparse 
estimation and Bayesian inference in detail (Section [s]). 

The ultimate motivation for novel developments presented here is sequential Bayesian experimental 
design (Section [g]), applied to acquisition optimization for medical imaging. We present a powerful 
variant of adaptive compressive sensing, which succeeds on real-world image data where theoretical 
proposals for non-adaptive compressive sensing [HIIHIITO] fail (Section |6.1[ ). Among our experimental 
results is part of a first successful study for Bayesian sampling optimization of magnetic resonance 
imaging, learned and evaluated on real- world image data (Section |7.4[ ). 

An implementation of the algorithms presented here is publicly available, as part of the glm-ie 
toolbox (Section 4.6). It can be downloaded from ,niloss . org/ sof tware/view/269/, 



2 Sparse Bayesian Inference. Variational Approximations 

In a sparse linear model (SLM), the image u £ M" of n pixels is unknown, and m linear measure- 
ments y G are given, where m <C n in many situations of practical interest. 

y = Xu + e, sr^N{0,a^I), (1) 

where X € M™^" is the design matrix, and £ is Gaussian noise of variance u^, implying the 
Gaussian likelihood P{y\u) = N{y\Xu,a'^I). Natural images are characterized by histograms 
of simple filter responses (derivatives, wavelet coefficients) exhibiting super- Gaussian (or sparse) 
form: most coefficients are close to zero, while a small fraction have significant sizes \35\ [28] (a 



precise definition of super-Gaussianity is given in Section 2.1). Accordingly, SLMs have super- 
Gaussian prior distributions P{u). The MAP estimator Wmap •= argmax.^ [log P(2/|u) +logP(it)] 
can outperform maximum likelihood Uj^i^ := argmax^ log P{y\u), when u represents an image. 

In this paper, we focus on priors of the form P{u) oc Ylj=iti{si), where s = Bu. The potential 
functions ti{-) are positive and bounded. The operator B G R'^^" may contain derivative filters or a 
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wavelet transform. Both X and B have to be structured or sparse, in order for any SLM algorithm 
to be scalable. Laplace (or double exponential) potentials are sparsity-enforcing: 

ti(si) = e-^'l^»l, n>0. (2) 

For this particular prior and the Gaussian likelihood ([T]), MAP estimation corresponds to a 
quadratic program, known as LASSO [37] for B = I. Note that logti{si) is concave. In gen- 
eral, if log-concavity holds for all model potentials, MAP estimation is a convex problem. Another 
example for sparsity potentials are Student's t: 

tM) = (1 + (r./i^)s2)-(-+i)/2, r„ > 0. (3) 

For these, \ogti{si) is not concave, and MAP estimation is not (in general) a convex program. Note 
that — logij(sj) is also known as Lorentzian penalty function. 

2.1 Variational Lower Bounds 

Bayesian inference amounts to computing moments of the posterior distribution 
P{u\y) = Z~^N{y\Xu, a^I) \^,^ ti{si), s = Bu, 
Z= I N{y\Xu,a^I)Yl^^^t,{s,)du. 

This is not analytically tractable in general for sparse linear models, due to two reasons coming 
together: P(u\y) is highly coupled {X is not blockdiagonal) and non-Gaussian. We focus on vari- 
ational approximations here, rooted in statistical physics. The log partition function log Z (also 
known as log evidence or log marginal likelihood) is the prime target for variational methods |42j . 
Formally, the potentials ti{si) are replaced by Gaussian terms of parametrized width, the posterior 
P{u\y) by a Gaussian approximation Q{u\y). The width parameters are adjusted by fitting Q(u\y) 
to P{u\y), in what amounts to the variational optimization problem. 




Figure 1: Super-Gaussian distributions admit tight Gaussian- form lower bounds of any width 7. 
Left: Laplace ([2]); middle: Bernoulli ([6|; right: Student's t ([s]). 



3 



Our variational relaxation exploits the fact that all potentials ti{si) are (strongly) super- Gaussian: 
there exists a bi G U. such that gi{si) := logij(si) — biSi is even, and convex and decreasing as a 
function of Xi := sf [23]. We write gi{xi) := Xj > in the sequel. This implies that 

U{s,) = maxe''»^'-^?/(2^«)-'^»(^>)/^ hi{^,) := max(-Xi/7i - 2g,{xi)). (5) 

7i>0 Xi>0 

A super-Gaussian ti{si) has tight Gaussian-form lower bounds of all widths (see Figure [l]). We 
replace ti{si) by these lower bounds in order to step from P{u\y) to the family of approximations 
Qiu\y) oc P(y|^/)e^^^-i^'^('^'^s^)"'^ where 7 = (7,). 

To establish ([s]), note that the extended-value function gi{xi) (assigning gi{xi) = 00 for Xi < 0) 
is a closed proper convex function, thus can be represented by Fenchel duality [26l Sect. 12]: 
gi{xi) = sup;^^ XjAj — g* (Xi) , whcrc the conjugate function g*{\i) = sup^^ XjAj — gi{xi) is closed 
convex as well. For Xj > and Xi > 0, we have that XjAj — gi{xi) > XiXi — gi{0) — 00 with Xi — >• 00, 
which implies that g*{Xi) = 00 for Aj > 0. Also, g*{jS) = —\\mxi-^oo gi{xi) , so that — 5*(0) < gi{xi) 
for any Xi < 00. Therefore, for any Xi G [0,cxd), gi{xi) = sup;^^<o XjAj — g*{Xi). Reparameterizing 
7i := — l/(2Aj), we have that gi{xi) = max^.>o — 2;j/(27j) — g*{—l/ {2^i)) (note that in order to 
accommodate 5j(0) in general, we have to allow for 7^ = 0). Finally, hii^i) := 2g* {—1 / (Iji)) . 

The class of super-Gaussian potentials is large. All scale mixtures (mixtures of zero-mean Gaussians 
Uisi) = E^jA^(sj|0, 7j)]) are super-Gaussian, and /ij(7i) can be written in terms of the density for 
7i [23]. Both Laplace ^ and Student's t potentials ([s]) are super-Gaussian, with /ij(7i) given in 



Appendix A. 6 Bernoulli potentials, used as binary classification likelihoods, 

U{s,) = (1 + e-y^-'^^y^ , Vi G {±1}, Ti>0 (6) 

are super-Gaussian, with bi = yiTi/2 Sect. 3.B]. While the corresponding hi{'yi) cannot be 
determined analytically, this is not required in our algorithms, which can be implemented based 
on gi{xi) and its derivatives only. Finally, if the tf\si) are super-Gaussian, so is the positive 

mixture Ylii=i^i^f\^i)-: > 0> because the logsumexp function x 1— log 1-^ exp(a;) fl] Sect. 3.1.5] 
is strictly convex on MJ" and increasing in each argument, and the log- mixture is the concatenation 
of logsumexp with (logtf^(sj) + log a/);, the latter convex and decreasing for Xj = > in each 
component [U Sect. 3.2.4]. 

A natural criterion for fitting Q{u\y) to P{u\y) is obtained by plugging these bounds into the 
partition function Z of Q: 



I N{y\Xu,a^I)e'''''-"2^^^-''du, h{j) := ^^'^/^(t^), (7) 



where T := diag7 and s = Bu. The right hand side is a Gaussian integral and can be evaluated 
easily. The variational problem is to maximize this lower bound w.r.t. the variational parameters 
7 ^ (7j > for all i = 1, . . . with the aim of tightening the approximation to logZ. This 
criterion can be interpreted as a divergence function: if the family of all Q{u\y) contained the true 
posterior P{u\y), the latter would maximize the bound. 

This relaxation has frequently been used before [11^ [23l [T7] on inference problems of moderate size. 
In the following, we provide results that extend its scope to large scale imaging problems of interest 
here. In the next section, we characterize the convexity of the underlying optimization problem 
precisely. In Section |4j we provide a new class of algorithms for solving this problem orders of 
magnitude faster than previously used techniques. 



4 



3 Convexity Properties of Variational Inference 



In this section, we characterize the variational optimization problem of maximizing the right hand 
side of ([7]). We show that it is a convex problem if and only if all potentials tj(sj) are log-concave, 
which is equivalent to MAP estimation being convex for the same model. 

We start by converting the lower bound ([T]) into a more amenable form. The Gaussian posterior 
Q{u\y) has the covariance matrix 

CovQ[it|y] = A-\ A := a'^X^X + B^T^B, T = diag7. (8) 

We have that 

J P(2/|u)e^^^-5«'^r-is ^ |27rA-V/VaxP(y|u)e^^^-5«^r-is^ ^ ^ 

since max^, Q{u\y) = \2ttA-'^\-^/'^ J Q{u\y) du. We find that Z > Cie-<^(^)/2, where 

(/)(7) :=log|A| +/i(7) + mini?(-u,7), R := a^'^Wy - Xuf + s'^T^'^s - 2b'^ s (9) 

u 

and Ci = (27r)'^"'~™'^/^(T~™'. The variational problem is min^^o0(7), and the Gaussian posterior 
approximation is Q{u\y) with the final parameters 7 plugged in. We will also write (j){u,^) := 
log \ A\ + h{'y) + R{u,'~f), so that 0(7) = min^ (j){u,^). 

It is instructive to compare this variational inference problem with maximum a posteriori (MAP) 
estimation: 

min —2 log P{u\y) = minu^^Hy — -X^w|P — 2 / logtj(si) + C2 

' (10) 

= min h{'y) + R{u,-f) + C2, 

where C2 is a constant. The difference between these problems rests on the log \A\ term, present in 
0(7) yet absent in MAP. Ultimately, this observation is the key to the characterization provided in 
this section and to the scalable solvers developed in the subsequent section. Its full relevance will 
be clarified in Section [5l 

3.1 Convexity Results 

In this section, we prove that (/>(7) is convex if all potentials tj(sj) are log-concave, with this 
condition being necessary in general. We address each term in ^ separately. 

Theorem 1 Let X G M™^", B G M"?^" be arbitrary matrices, and 

A{d) := a''^X^X + B^{d[agd)B, dyO, 
so that A{d) is positive definite for all d >- 0. 

(1) Let fii'ji) be twice continuously dijjerentiable functions into M+, so that log/j(7j) are convex 
for all i and ji. Then, 7 1— >■ log | A(/(7))| is convex. Especially, 7 1— >• log \A\ is convex. 
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(2) Let fi^TTi) be concave functions into M-|_. Then, tv i— )• log | A(/(7r))| is concave. Especially, 

I— )• log \A\ is concave. 

(3) Let fi[^i) be concave functions intoM^. Then, 7 1— )• l'^(log/(7))+log |A(/(7)~^)| is concave. 
Especially, 7 1— l"^(log7) + log |A| is concave. 

(4) Let Q{u\y) be the approximate posterior with covariance matrix given by Then, for all i: 

\a.TQ[si\y] = 5jBA-^B^5, < 7^. 



A proof is provided in Appendix A.l Instantiating part M with /i(7j) = 7^ ^, we see that 7 1— 
log|A| is convex. Other valid examples are fi{^i) = 7^ , ft > 0. For /i(7j) = e'^\ we obtain the 
convexity of 7 1— log | A(exp(7))|, generalizing the logsumexp function to matrix values. Part Q 
and part ^ will be required in Section 4.2 Finally, part Q gives a precise characterization of 7^ 



as sparsity parameter, regulating the variance of Sj. 

Theorem 2 The function 

7 H> 0(7) - hH) = log lAI + min ia^'^Wy - XwlP + s^T~^s - 2b^s) 

is convex for 'y y 0, where s = Bu. 



Proof. The convexity of log |A| has been shown in Theorem a~'^\\y — XuW^ — 2b'^ s is convex in 
u, and (w, 7) 1— s- s^V^^s is jointly convex, since the quadratic-over-linear function {si,'^i) 1— )• sf/^i is 
jointly convex for 7i > [H Sect. 3.1.5]. Therefore, miuit R{u, 7) is convex for 7 ;^ [H Sect. 3.2.5]. 
This completes the proof. 

To put this result into context, note that 

0(7) - /i(7) = -2 log j P(y|«)e^^^-5«^r-is s = Bu 

is the negative log partition function of a Gaussian with natural parameters 7~^: it is well known 
that 7~^ I—)- 0(7) — h{'~f) is a concave function [l2]. However, 7"^ 1— )• /i(7) is convex for a model 
with super-Gaussian potentials (recall that hi{'yi) = 2g*{—l/{2^i)), where g*{-) is convex as dual 
function of gi{-)), which means that in general 7"^ 1— t- (j){'~f) need not be convex or concave. The 
convexity of this negative log partition function w.r.t. 7 seems specific to the Gaussian case. 

Given Theorem [2| if all hi{-fi) are convex, the whole variational problem min-y^o (/) is convex. With 
the following theorem, we characterize this case precisely. 

Theorem 3 Consider a model with Gaussian likelihood ^ and a prior P{u) oc '\Xi=i'ti{si) , s = 

Bu, so that all ti{si) are strongly super- Gaussian, meaning that gi{si) = logtj(sj) — biSi is even, 
1/2 

and gi{xi) = gi{x^ ) is strictly convex and decreasing for > 0. 

(1) If gi{si) is concave and twice continuously differentiable for Si > 0, then hi{^i) is convex. On 
the other hand, if g'l{si) > for some Sj > 0, then hi{^i) is not convex at some 7^ > 0. 

(2) If all gi{si) are concave and twice continuously differentiable for Si > 0, then the variational 
problem min-y^o 4> is a convex optimization problem. On the other hand, ifg'l{si) > for some 
i and Si > 0, then h{-y) is not convex, and there exist some X, B, and y such that ^(7) is 
not a convex function. 
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The proof is given in Appendix A. 2 Our theorem provides a complete characterization of convexity 
for the variational inference relaxation of Section [2| which is the same as for MAP estimation. Log- 
concavity of all potentials is sufficient, and necessary in general, for the convexity of either. We are 
not aware of a comparable equivalence having been established for any other nontrivial approximate 
inference method for continuous variable models. 



We close this section with some examples. For Laplacians ([2]), hiiji) = rf^i (see Appendix A. 6). 
For SLMs with these potentials, MAP estimation is a convex quadratic program. Our result implies 
that variational inference is a convex problem as well, albeit with a differentiable criterion. Bernoulli 
potentials ([g]) are log-concave. MAP estimation for generalized linear models with these potentials 
is known as penalized logistic regression, a convex problem typically solved by the iteratively 
reweighted least squares (IRLS) algorithm. Variational inference for this model is also a convex 
problem, and our algorithms introduced in Section |4] make use of IRLS as well. Finally, Student's 
t potentials (|3]) are not log-concave, and hi{^i) is neither convex nor concave (see Appendix A. 6). 
Neither MAP estimation nor variational inference are convex in this case. 

Convexity of an algorithm is desirable for many reasons. No restarting is needed to avoid local 
minima. Typically, the result is robust to small perturbations of the data. These stability properties 
become all the more important in the context of sequential experimental design (see Section [6|, 
or when Bayesian model selectiorj^ is used. However, the convexity of (/'(t) does not necessarily 
imply that the minimum point can be found efficiently. In the next section, we propose a class of 
algorithms that solve the variational problem for very large instances, by decoupling the criterion 
Q in a novel way. 



4 Scalable Inference Algorithms 

In this section, we propose novel algorithms for solving the variational inference problem min-^, (/> in 
a scalable way. Our algorithms can be used whether (;/)(7) is convex or not, they are guaranteed to 
converge to a stationary point. All efforts are reduced to well known, scalable algorithms of signal 
reconstruction and numerical mathematics, with little extra technology required, and no additional 
heuristics or step size parameters to be tuned. 

We begin with the special case of log-concave potentials tj(sj), such as Laplace ([2]) or Bernoulli 
(|6]), extending our framework to full generality in Section 4.1 The variational inference problem 



is convex in this case (Theorem [s]) . Previous algorithms for solving min-y (/>(')') |1H [23] are of the 
coordinate descent type, minimizing (p w.r.t. one 7j at a time. Unfortunately, such algorithms cannot 
be scaled up to imaging problems of interest here. An update of ji depends on the marginal posterior 
Q{si\y), whose computation requires the solution of a linear system with matrix A G M"^". At the 
projected scale, neither A nor a decomposition thereof can be maintained, systems have to be solved 
iteratively. Now, each of the q potentials has to be visited at least once, typically several times. 
With q, n, and m in the hundred thousands, it is certainly infeasible to solve 0{q) linear systems. In 
contrast, the algorithms we develop here often converge after less than hundred systems have been 
solved. We could also feed (p{'y) and its gradient V-y0 into an off-the-shelf gradient-based optimizer. 
However, as already noted in Sectionjsj </>(7) is the sum of a standard penalized least squares (MAP) 
part and a highly coupled, computationally difficult term. The algorithms we propose take account 
of this decomposition, decoupling the troublesome term in inner loop standard form problems which 



^ Model selection (or hyperparameter learning) is not discussed in this paper. It can be implemented easily by 
maximizing the lower bound —(j){f)/2 + logCi < logZ w.r.t. hyperparameters. 
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can be solved by any of a large number of specialized algorithms not applicable to min-y The 
expensive part of 'SJ^cj) has to be computed only a few times for our algorithms to converge. 

We make use of a powerful idea known as double loop or concave-convex algorithms. Special cases 
of such algorithms are frequently used in machine learning, computer vision, and statistics: the 
expectation-maximization (EM) algorithm [8], variational mean field Bayesian inference [1], or 
CCCP for discrete approximate inference |48| . among many others. The idea is to tangentially 
upper bound (j) by decoupled functions which are much simpler to minimize than (p itself: algo- 
rithms iterate between refitting (pz to (j) and minimizing (pz- For example, in the EM algorithm for 
maximizing a log marginal likelihood, these stages correspond to "E step" and "M step" : while the 
criterion could well be minimized directly (at the expense of one "E step" per criterion evaluation), 
"M step" minimizations are much easier to do. 

As noted in Section [sj if the variational criterion Q lacked the log|A| part, it would correspond 
to a penalized least squares MAP objective (10), and simple efficient algorithms would apply. 

evaluating log|A| or its gradient are computationally challenging. 



As discussed in Section 4.4 



Crucially, this term satisfies a concavity property. As shown in Section 4.2 Fenchel duality implies 
that log \ A\ < zf {-y'^) — gl{zi) . For any fixed zi >- 0, the upper bound is tangentially tight, convex 
in 7, and decouples additively. If log|A| is replaced by this upper bound, the resulting objective 
02^(^,7) := zf{'y~^) + /i(7) + R{u,'-f) — gl{zi) is of the same decoupled penalized least squares 



form than a MAP criterion (10). This decomposition suggests a double loop algorithm for solving 
min-y(/)(7). In inner loop minimizations, we solve minu^-yyo (j)zi for fixed zi >- 0, and in interjacent 
outer loop updates, we refit zi ^ argmin i?!)^^ (u, 7). 

The MAP estimation objective (10) and (pziiu,-/) have a similar form. Specifically, recall that 

1 /2 

-2gi{xi) = miny.>oXi/7i + hi{-ji), where gi{xi) = gi{x-' ) and gi{si) = logti(si) - ftjSj. The inner 
loop problem is 



min (j)zi{u,'-f) 



mmo" 



\y 



Xu\ 



[gi{ziA + Si 



+ biSi) , 



(11) 



where s = Bu. This is a smoothed version of the MAP estimation problem, which would be 
obtained for zi^i = 0. However, zi^i > in our approximate inference algorithm at all times (see 
Section 4.2). Upon inner loop convergence to tx*, 7=1,^4 = —l/[2{dgi/dxi)\^.^^_^__^_g2 ], where = 
BUi:. Note that in order to run the algorithm, the analytic form of /ij(7i) need not be known. For 

and 7*,j = 



Laplace potentials (2), the inner loop penalizer is ^Y^^TiJ zi^i + s. 



Zl,i + sljTi. 



Importantly, the inner loop problem (11) is of the same simple penalized least squares form than 



MAP estimation, and any of the wide range of recent efficient solvers can be plugged into our 
method. For example, the iteratively reweighted least squares (IRLS) algorithm [14J, a variant of 



the Newton-Raphson method, can be used (details are given in Section 4.3). Each Newton step 
requires the solution of a linear system with a matrix of the same form as A ([s]), the convergence 
rate of IRLS is quadratic. It follows from the derivation of ^ that once an inner loop has converged 
to (ii*,7*), the minimizer u^: is the mean of the approximate posterior Q{u\y) for 7=,,. 

The rationale behind our algorithms lies in decoupling the variational criterion (j) via a Fenchel dual- 
ity upper bound, thereby matching algorithmic scheduling to the computational complexity struc- 
ture of (j). To appreciate this point, note that in an off-the-shelf optimizer applied to min-^^o 
both 0(7) and the gradient V-yCp have to be computed frequently. In this respect, the log \A\ cou- 
pling term proves by far more computationally challenging than the rest (see Section 4.4). This 



obvious computational difference between parts of </>(7) is not exploited in standard gradient based 
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algorithms: they require all of ^-ycj) in each iteration, all of fj^il) in every single line search step. 
As discussed in Section 4.4, computing log | A| to high accuracy is not feasible for models of inter- 
est here, and most off-the-shelf optimizers with fast convergence rates are very hard to run with 
such approximately computed criteria. In our algorithm, the critical part is recognized and decou- 
pled, resulting inner loop problems can be solved by robust and efficient standard code, requiring 
a minimal effort of adaptation. Only at the start of each outer loop step, (j)z^ has to be refitted: 
zi <— V^-ilog|A| (see Section 4.2), the computationally critical part of V-yc/) is required there. 



Fenchel duality bounding is used to minimize the number of these costly steps (further advantages 
are noted at the end of Section 4.4). Resulting double loop algorithms are simple to implement 
based on efficient penalized least squares reconstruction code, taking full advantage of the very 
well-researched state of the art for this setup. 



4.1 The General Case 

In this section, we generalize the double loop algorithm along two directions. First, if potentials 
logtj(sj) are not log-concave, the inner loop problems (11) are not convex in general (Theorem [3|, 



yet a simple variant can be used to remedy this defect. Second, as detailed in Section [4.2[ there 
are different ways of decoupling log|A|, giving rise to different algorithms. In this section, we 
concentrate on developing these variants, their practical differences and implications thereof are 
elaborated on in Section [5l 

If ti{si) is not log-concave, then hi{^i) is not convex in general (Theorem [3]) . In this case, we can 
write hi{'~fi) = hn^ii^ji) + h\j,i{'yi), where /in,i is concave and nondecreasing, /iu,j is convex. Such a 
decomposition is not unique, and has to be chosen for each hi at hand. With hindsight, /in,j should be 
chosen as small as possible (for example, /in,i = if ti{si) is log-concave, the case treated above), and 



if IRLS is to be used for inner loop minimizations (see Section 4.3), /iu,j should be twice continuously 
differentiable. For Student's t potentials ([s]), such a decomposition is given in Appendix A. 6, We 
define /in(7) = X^i^n,i(7j)) ^u(7) = X^i^u,j(7j)i and modify outer loop updates by applying a 
second Fenchel duality bounding operation: /in (7) ^ ^2^7 ~ 52(^2), resulting in a variant of the 



inner loop criterion (11). If /in.i is differentiable, the outer loop update is Z2 h'^iili), otherwise 



any element from the subgradient can be chosen (note that 22 > 0, as /in,i is nondecreasing). 



Moreover, as shown in Section 4.2, Fenchel duality can be employed in order to bound log \A\ in 
two different ways, one employed above, the other being log | A| < z^^ — 1^ {log'^) — g2{z2) , Z2 ^ 0. 
Combining these bounds (by adding Z2 to 2:2)) we obtain 

0(7, w) < <Pz{u,l) := zj{-f-^) + zl-f - zl {\og^) + hyj{-i) + R{u,-() - g*{z), 

where z^^i E {0,1}, and g*{z) collects the offsets of all Fenchel duality bounds. Note that Zj ^ 
for j = 1, 2, 3, and for each i, either zi^i > and zs^i = 0, or zi^i = and 22,4 > 0, z^^i = 1. We have 
that 

(l)z(.u) := mm(l)z{u,-f) = a''^\\y - Xuf + 2^^"^ h*{si) - 2b^s, 

1 , (12) 

h*iisi) ■= 7, mm + ■5i)/7i + ^2,i7i - 23,ilog7j + hu^i{'yi), s = Bu. 

I 7i>0 

^ Note that Fenchel duahty bounding is also used in difference-of-convex programming, a general framework to 
address non-convex, typically non-smooth optimization problems in a double loop fashion. In our application, (^(7) 
is smooth in general and convex in many applications (see Section [sjl: our reasons for applying bound minimization 
are different. 
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Note that h*{si) is convex as minimum (over 7^ > 0) of a jointly convex argument [H Sect. 3.2.5]. 
The inner loop minimization problem min^ (min-,, (j)z ) is of penalized least squares form and can be 



solved with the same array of efficient algorithms applicable to the special case (11). An application 
of the second order IRLS method is detailed in Section lXsl A schema for the full variational inference 
algorithm is given in Algorithm [T] 

Algorithm 1 Double loop variational inference algorithm 
repeat 

if first outer loop iteration then 

Initialize bound (pz- u = 0. 
else 

Outer loop update: Refit upper bound (t)z to 4> (tangent at 7). 



Requires marginal variances z = diag-^SA-i^^) (Section |4.4[). 
Initialize u = (previous solution), 
end if 
repeat 



Newton (IRLS) iteration to minimize min-y)-o 0z (12) w.r.t. u. 

Entails solving a linear system (by LCG) and line search (Section |4.3[ ). 
until = argmin„ (min-),^o 'Pz) converged 
Update 7 = argmin^^o <t'z{u*, •)• 
until outer loop converged 

The algorithms are specialized to the ti{si) through h*{si) and its derivatives. The important special 
case of log-concave ti(sj) has been detailed above. For Student's t potentials ([S]), a decomposition 



is detailed in Appendix A. 6 In this case, the overall problem min-),;-o 4>{l) is not convex, yet our 



double loop algorithm iterates over standard-form convex inner loop problems. Finally, for log- 



concave ti{si) and Z2,i / (type B bounding, Section 4.2), our algorithm can be implemented 
generically as detailed in Appendix |A.5 



We close this section establishing some characteristics of these algorithms. First, we found it useful 
to initialize them with constant zi and/or Z2 of small size, and with u = 0. Moreover, each 
subsequent inner loop minimization is started with u = from the last round. The development 
of our algorithms is inspired by the sparse estimation method of [43j . relationships to which are 
discussed in Section [5j Our algorithms are globally convergent, a stationary point of 0(7) is found 
from any starting point 7 ;^ (recall from Theoremjsjthat for log-concave potentials, this stationary 
point is a global solution). This is seen as detailed in [33]. Intuitively, at the beginning of each outer 
loop iteration, (j)z and (j) have the same tangent plane at 7, so that each inner loop minimization 
decreases (j) significantly unless V-y(/) = 0. Note that this convergence proof requires that outer loop 



updates are done exactly, this point is elaborated on at the end of Section |4.4 

Our variational inference algorithms differ from previous method^ in that orders of magnitude 
larger models can successfully be addressed. They apply to the particular variational relaxation 
introduced in Section [sj whose relationship to other inference approximations is detailed in [29] . 
While most previous relaxations attain scalability through many factorization assumptions con- 
cerning the approximate posterior, Q{u\y) in our method is fully coupled, sharing its conditional 
independence graph with the true posterior P{u\y). A high-level view on our algorithms, discussed 



^ This comment holds for approximate inference methods. For sparse estimation, large scale algorithms are available 
(see Section [5]). 



10 



in Section 4.4 , is that we replace a priori independence (factorization) assumptions by less damaging 
low rank approximations, tuned at runtime to the posterior shape. 



4.2 Bounding log I A| 



We need to upper bound log \A\ by a term which is convex and decoupling in 7. This can be done 
in two different ways using Fenchel duality, giving rise to bounds with different characteristics. 



Details for the development here are given in Appendix A. 4 

Recall our assumption that A >- for each 7 ;^ 0. If tt = 7"^, then tt i— )• log|A(7r)| = log|A| is 
concave for iv y (Theorem [l]j2| with / = id). Moreover, log|A(7r)| is increasing and unbounded 
in each component of tt (Theorem [4|. Fenchel duality |26| Sect. 12] implies that log|A(7r)| = 
raiiiz^^o zj TV — gl{zi) for tv )^ 0, thus log|A| = miuz^^o zj {-y'^) — glizi) for 7^0. Therefore, 
log \ A\ < 2^(7^^) — gl{zi). For fixed 7^0, this is an equality for 



Zl,* 



V^-i log|A| 



z := (VarQ[s,|y]) = diag^^ {BA-^B^) >~ 0, 



and gl{zi^^:) = zf^i^j ) — log \ A\. This is called bounding type A in the sequel. 

On the other hand, 7 1— )• l"^(log7) + log|A| is concave for 7 ;^ (Theorem [ijjs]) with / = id). 
Employing Fenchel duality once more, we have that log \ A\ < zj^— l"'"(log7) — g2{z2), Z2 ^ 0. For 
any fixed 7, equality is attained at 22,* = 7~^o(l— 7 ""^oz), and 52 (-^2,*) = 2;J^7— log | A| — l-^(log7) 
at this point. This is referred to as bounding type B. 




Figure 2: Comparison of type A and B upper bounds on log(l + 2/7). 

In general, type A bounding is tighter for 7j away from zero, while type B bounding is tighter for ji 
close to zero (see Figure[2]), implications of this point are discussed in Sectionjsj Whatever bounding 
type we use, refitting the corresponding upper bound to log \A\ requires the computation of £ = 
(VarQ[si\y]): all marginal variances of the Gaussian distribution Q{s\y). In general, computing 
Gaussian marginal variances is a hard numerical problem, which is discussed in more detail in 
Section 1131 
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4.3 The Inner Loop Optimization 



The inner loop minimization problem is given by ( |12[ ), its special case ( |11[ ) for log-concave potentials 
and log|A| bounding type A is given by h*{si) = —gi{zi^i + sf). This problem is of standard 
penalized least squares form, and a large number of recent algorithms [12| El US] can be applied 
with little customization efforts. In this section, we provide details about how to apply the iteratively 
reweighted least squares (IRLS) algorithm [14j, a special case of the Newton- Raphson method. 

We describe a single IRLS step here, starting from u. Let r := Xu — y denote the residual vector. 
If 9^ := {h*y{si) - h, Pi := {h*)"{s^), then 

V^(0^/2) = cj-^X'^r + B^e, VVui<Pz/2) = a'^X^X + S^(diagp)B. 

Note that /Oj > by the convexity of h*. The Newton search direction is 

d := -{a-^X^X + B^{diagp)B)-\a-^X^r + B'^e). 

The computation of d requires us to solve a system with a matrix of the same form as A, a 
reweighted least squares problem otherwise used to compute the means in a Gaussian model of 
the structure of Q{u\y). We solve these systems approximately by the (preconditioned) linear 
conjugate gradients (LOG) algorithm [13j. The cost per iteration of LCG is dominated by matrix- 
vector multiplications (MVMs) with X^X, B, and B^ . A line search along d can be run in 
negligible time. If f{t) := (j)^{u + td)/2, then f'{t) = a-^{{Xdfr + t\\Xd\\^) + {Bd)'^0'^^\ where 
is the gradient at = s + tBd. With {Xd)'^r, || JCdp, and Bd precomputed, f{t) and 
f'{t) can be evaluated in 0{q) without any further MVMs. The line search is started with to = 1. 
Finally, once = argmin^ <t>z{u) is found, 7 is explicitly updated as argmin c;/)^ (u* , •). Note that 
at this point, = EQ[ti|y], which follows from the derivation at the beginning of Section Isj 



4.4 Estimation of Gaussian Variances 



Variational inference does require marginal variances z = diag "^{BA ^B^) = (VaTQ[si\y]) of the 



Gaussian Q{s\y) (see Section 4.2), which are much harder to approximate than means. In this 
section, we discuss a general method for (co)variance approximation. Empirically, the performance 
of our double loop algorithms is remarkably robust in the light of substantial overall variance 
approximation errors, some insights into this finding are given below. 

Marginal posterior variances have to be computed in any approximate Bayesian inference method, 
while they are not required in typical sparse point estimation techniques (see Section [s]) . Our 
double loop algorithms reduce approximate inference to point estimation and Gaussian (co)variance 
approximation. Not only do they expose the latter as missing link between sparse estimation and 
variational inference, their main rationale is that Gaussian variances have to be computed a few 
times only, while off-the-shelf variational optimizers query them for every single criterion evaluation. 

Marginal variance approximations have been proposed for sparsely connected Gaussian Markov 
random fields (MRFs), iterating over embedded spanning tree models [JT] or exploiting rapid cor- 
relations decay in models with homogeneous prior [20j. In applications of interest here, A is neither 
sparse nor has useful graphical model structure. Committing to a low-rank approximation of the 
covariance A~^ \20\ I27j. an optimal choice in terms of preserving variances is principal components 
analysis (PGA), based on the smallest eigenvalues/- vectors of A (the largest of A~^). The Lanczos 
algorithm [13] provides a scalable approximation to PGA and was employed for variance estimation 
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in [27]. After k iterations, we have an orthonormal basis S M"^'^, within which extremal eigen- 
vectors of A are rapidly well approximated (due to the nearly linear spectral decay of typical A ma- 
trices (Figurejoj upper panel), both largest and smallest eigenvalues are obtained). As Qj^AQk = 
is tridiagonal, the Lanczos variance approximation = diag~^ {BQkTjT^Q^^B^) can be computed 
efficiently. Importantly, Zk,i < z^+i^i < Zi for all k and i. Namely, if Qk = [qi ■ ■ ■ Qk], and has 
main diagonal (a;), subdiagonal {(3i), let (e^) and (d/) be the main and subdiagonal of the bidi- 
agonal Cholesky factor Lk of T^. Then, d^-i = f3k-i/ek^i, = (ok — d1_-^)^/'^, with do = 0. If 
Vk := BQkL~^, we have Vk = [vi . ..Vk], Vk = {Bq^ - 4-i^'fc-i)/efc. Finally, Zfc = + vl 
(with io = 0). 

Unfortunately, the Lanczos algorithm is much harder to run in practice than LCG, and its cost grows 
superlinearly in /c. A promising variant of selectively reorthogonalized Lanczos p3] is given in [2], 
where contributions from undesired parts of the spectrum (A's largest eigenvalues in our case) are 
filtered out by replacing A with polynomials of itself. Recently, randomized PCA approaches have 
become popular [15], although their relevance for variance approximation is unclear. Nevertheless, 
for large scale problems of interest, standard Lanczos can be run for k <^n iterations only, at which 



point most of the ai'e severely underestimated (see Section 7.3). Since Gaussian variances are 
essential for variational Bayesian inference, yet scalable, uniformly accurate variance estimators are 
not known, robustness to variance approximations errors is critical for any large scale approximate 
inference algorithm. 

What do the Lanczos variance approximation errors imply for our double loop algorithms? First, 



the global convergence proof of Section 4.1 requires exact variances z, thus may be compromised if 
Zfc is used instead. This problem is analyzed in [31j: the convergence proof remains valid with the 
PCA approximation, which however is different from the Lanczo^ approximation. Empirically, we 
did not encounter convergence problems so far. 

Surprisingly, while zj. is much smaller than z in practice, there is little indication of substantial 



negative impact on performance. This important robustness property is analyzed in Section 7.3 
for an SLM with Laplace potentials. The underestimation bias has systematic structure (Figure |6| 
middle and lower panel): moderately small Zi are damped most strongly, while large Zi are ap- 
proximated accurately. This happens because the largest coefficients Zi depend most strongly on 
the largest covariance eigenvectors, which are shaped in early Lanczos iterations. This particular 
error structure has statistical consequences. Recalling the inner loop penalty for Laplacians ([2]): 
h*{si) = Ti{zi + sfY^"^, the smaller Zi, the stronger it enforces sparsity. If Zi is underestimated, 
the penalty on Si is stronger than intended, yet this strengthening does not happen uniformly. 
Coefficients Si deemed most relevant with exact variance computation (largest Zi) are least affected 
(as Zk,i ~ Zi for those), while already subdominant ones (smaller Zi) are suppressed even stronger 
(as Zk^i <C Zi). At least in our experience so far (with sparse linear models), this selective variance 
underestimation effect seems benign or even somewhat beneficial. 



4.5 Extension to Group Potentials 

There is substantial recent interest in methods incorporating sparse group penalization., meaning 
that a number of latent coefficients (such as the column of a matrix, or the incoming edge weights 
for a graph) are penalized jointly [47^ I40j . Our algorithms are easily generalized to models with 

While Lanczos can be used to compute the PCA approximation (fixed number L of smaUest eigenvalues/-vectors 
oi A), this is rather wasteful. 
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potentials of the form tj(||sj||), Sj a subvector of s, || • || the Euchdean norm, if ti{-) is even and 
super-Gaussian. Such group potentials are frequently used in practice. The isotropic total variation 
penalty is the sum of dxi, dyi differences along different axes, which corresponds to 

group Laplace potentials. In our MRI application (Section |7.4[ ), we deal with complex-valued u and 
s. Each entry is treated as element in M'^, and potentials are placed on \si\ = 9sj)||. Note 

that with ti on the single parameter ji is shared by the coefficients Sj. 

The generalization of our algorithms to group potentials is almost automatic. For example, if all Sj 
have the same dimensionality, is replaced by (g) / in the definition of A, and z is replaced 



by (/eg) 1"^) diag ^{BA ^B'^) in Section 4.2, Moreover, Xi = sf is replaced by Xi = whereas 



the definition of gi{xi) remains the same. Apart from these simple replacements, only IRLS inner 



loop iterations have to be modified (at no extra cost), as is detailed in Appendix A. 7 



4.6 Publicly Available Code: The glm-ie Toolbox 

Algorithms and techniques presented in this paper are implementecj^as part of the generalized linear 
model inference and estimation toolbox (glm-ie), maintained as mloss . org project at mloss . org/ 
sof tware/view/269/, The code runs with both Matlab 7 and the free Octave 3.2. It comprises 
algorithms for MAP (penalized least squares) estimation and variational inference in generalized 
linear models (Section |4]), along with Lanczos code for Gaussian variances (Section 4.4). 



Its generic design allows for a range of applications, as illustrated by a number of example programs 
included in the package. Many super-Gaussian potentials ti{si) are included, others can easily be 
added by the user. In particular, the toolbox contains a range of solvers for MAP and inner loop 



problems, from IRLS (or truncated Newton, see Section 4.3) over conjugate gradients to Quasi- 



Newton, as well as a range of commonly used operators to construct X and B matrices. 



5 Sparse Estimation and Sparse Bayesian Inference 

In this section, we contrast approximate Bayesian inference with point estimation for sparse linear 
models (SLMs): sparse Bayesian inference versus sparse estimation. These problem classes serve 
distinct goals and come with different algorithmic characteristics, yet are frequently confused in 
the literature. Briefly, the goal in sparse estimation is to eliminate variables not needed for the 
task at hand, while sparse inference aims at quantifying uncertainty in decisions and dependencies 
between components. While variable elimination is a boon for efficient computation, it cannot be 
relied upon in sparse inference. Sensible uncertainty estimates like posterior covariance, at the heart 
of decision-making problems such as Bayesian experimental design, are eliminated alongside. 

We restrict ourselves to super-Gaussian SLM problems in terms of variables u and 7^0, relating 
the sparse Bayesian inference relaxation min-),;-o '/'(t) with two sparse estimation principles: maxi- 



mum a posteriori (MAP) reconstruction (10) and automatic relevance determination (ARD) [43], 
a sparse reconstruction method which inspired our algorithms. We begin by establishing a key dif- 
ference between these settings. Recall from Theorem [l]|4| that 7i = implie^ VarQ[sj|y] = 0: Si is 
eliminated, fixed at zero with absolute certainty. Exact sparsity in 7 does not happen for inference, 
while sparse estimation methods are characterized by fixing many ji to zero. 



^ Our experiments in Section [t] use different C++ and Fortran code, which differs from glm-ie mainly by being 
somewhat faster on large problems. 

® While the proof of Theorem lU4| holds for 7 ;^ 0, VarQ[sijy] is a continuous function of 7. 
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Theorem 4 Let X £ M™^", B € M'?''" be matrices such that A(7r) = a-'^X^X + B^{disig7T)B y 
for each iv >- 0, and no row of B is equal to 0^ . 

• The function log|A(7r)| is increasing in each component VTj, and unbounded above. For any 
sequence ivt with \\7Tt\\ — )• oo (t ^ oo) and ivt ^ el for some e > 0, we have that log | A(7rj)| — )• 
oo (t ^ oo). 

• Assume that log P{u\y) is bounded above as function of u. Recall the variational criterion 
4'il) from For any bounded sequence 'jt with {'yt)i (t ^ oo) for some i G {1, . . . , q}, 
we have that (pi'ft) — ^ co- 
in particular, any local minimum point 'y^ of the variational inference problem min-),;-o ^Pil) 
must have positive components: 7* ^ 0. 



A proof is given in Appendix A. 3 log|A| acts as barrier function for 7^0. Any local minimum 
point 7* of ([9]) is positive throughout, and VarQ[sj|y] > for alH = 1, . . . , Coefficient elimination 
does not happen in variational Bayesian inference. 



Consider MAP estimation ( 10 ) with even super-Gaussian potentials ti [si). Following [25] , a sufficient 
condition for sparsity is that — log tj(si) is concave for Si > 0. In this case, if rk X = m and rk B = n, 
then any local MAP solution is exactly sparse: no more than m coefficients of s* = Bu^, are 
nonzero. Examples are ij(sj) = e"'^''*'''', p G (0, 1], including Laplace potentials {p = 1). Moreover, 



7*.i = whenever = in this case (see Appendix A. 3). Local minimum points of SLM MAP 
estimation are substantially exactly sparse, with matching sparsity patterns of = Bu^, and 7*. 

A powerful sparse estimation method, automatic relevance determination (ARD) [43J, has in- 
spired our approximate inference algorithms developed above. The ARD criterion (/>ard is Q 
with h{'y) = l-^(log7), obtained as zero-temperature limit {v — )• 0) of variational inference with 
Student's t potentials ([3]). The function hi{'yi) is given in Appendix |A.6[ and hi{'^i) — )• log7j {v — )■ 0) 
if additive constants independent of 7, are droppedj^ ARD can also be seen as marginal likelihood 
maximization: (pAKoil) = —2 log J P{y\u)N{s\Q,T) du up to an additive constant. Sparsity pe- 
nalization is implied by the fact that the prior A^(s|0,r) is normalized (see Figure [3| left). The 
ARD problem is not convex. A provably convergent double loop ARD algorithm is obtained by 



employing bounding type B (Section 4.2), along similar lines to Section 4.1 we obtain 



min(/)ARD(7) = min ( miner ^Hj/ - + 2 V]'' zyf\si\ \ - g*2{z2). 

7)^0 Z2^0 \ U '4 = 1 ' J 

The inner problem is l\ penalized least squares estimation, a reweighted variant of MAP reconstruc- 
tion for Laplace potentials. Its solutions s=k = Sii* are exactly sparse, along with corresponding 

— 1/2 

7* (since 7*,j = I'**,*!)- ARD is enforcing sparsity more aggressively than Laplace MAP 
reconstruction [Hj. The log|A| barrier function is counterbalanced by /i(7) = l^(log7) = log|r|. 
If S = /, then 

log I A| + log |r| = log |/ + cr^^xrx^i ^ (7 ^ 0). 

The conceptual difference between ARD and our variational inference relaxation is illustrated in 
Figure [sj In sparse inference, Gaussian functions e"*?/^^'^'-'"'*'^''''^/^ lower bound ti{si). Their mass 
vanishes as 7^ — )■ 0, driving (p^-f) — )• 00. For ARD, Gaussian functions iV(sj |0,7j) are normalized, 
and 7i — )■ is encouraged. 



Note that the term dropped {d in Appendix A. 6 1 becomes unbounded as — >■ 0. Removing it is essential to 
obtain a well-defined problem. 
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Figure 3: Different roles of Gaussian functions (width 7j) in sparse estimation versus sparse in- 
ference. Left: Sparse estimation (ARD). Gaussian functions are normalized, there is incentive to 
drive 7^ — )• 0. Right: Variational inference for Laplace potentials Q. Gaussian functions are lower 
bounds of ti{si), their mass vanishes as 7^ — )• 0. No incentive to eliminate 7j. 



At this point, the roles of different bounding types introduced in Section [4.2| become transparent, 
log \A\ is a barrier function for 7 ;^ (Theoremk|, as is its type A bound zf{'y~^) —gl{zi), zi >- 
(see Figure [2]). On the other hand, log|A| + l^log7) is bounded below, as is its type B bound 
ZJ7— 5(2(2:2). These facts suggest that type A bounding should be preferred for variational inference. 



while type B bounding is best suited for sparse estimation. Indeed, experiments in Section 7.1 show 
that for approximate inference with Laplace potentials, type A bounding is by far the better choice, 
while for ARD, type B bounding leads to the very efficient algorithm just sketched. 

Sparse estimation methods eliminate a substantial fraction of 7's coefficients, variational inference 
methods do not zero any of them. This difference has important computational and statistical 
implications. First, exact sparsity in 7 is computationally beneficial. In this regime, even coordinate 
descent algorithms can be scaled up to large problems |39) . Within the ARD sparse estimation 
algorithm, variances z <r- diag~^ [BA^^B^) have to be computed, but since z is as sparse as 7, 
this is not a hard problem. Variational inference methods have to cope without exact sparsity. The 
double loop algorithms of Section [4] are scalable nevertheless, reducing to numerical techniques 
whose performance does not depend on the sparsity of 7. 

While exact sparsity in 7' implies computational simplifications, it also rules out proper model-based 
uncertainty quantificationj^If 7^ = 0, then VarQ[si|y] = 0. If Q{u\y) is understood as representation 
of uncertainty, it asserts that there is no posterior variance in Sj at all: Si is eliminated with absolute 
certainty, along with all correlations between Si and other sj. Sparsity in 7 is computationally useful 
only if most 7, = 0. Q{u\y), a degenerate distribution with mass only in the subspace corresponding 
to surviving coefficients, cannot sensibly be regarded as approximation to a Bayesian posterior. As 
zero is just zero, even basic queries such as a confidence ranking over eliminated coefficients cannot 
be based on a degenerate Q{u\y). 

In particular, Bayesian experimental design (Section [6]) cannot sensibly be driven by underlying 



sparse estimation technology, while it excels for a range of real-world scenarios (see Section 7.4) 



* Uncertainty quantification may also be obtained by running sparse estimation many times in a bootstrapping 
fashion [2T]. While such procedures cure some robustness issues of MAP estimation, they are probably too costly to 
run in order to drive experimental design, where dependencies between variables are of interest. 
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when based on a sparse inference method [361 IMl 1321 EH • The former approach is taken in [18] , 
employing the sparse Bayesian learning estimator [38] to drive inference queries. Their approach 
fails badly on real-world image data [32]. Started with few initial measurements, it identifies a 
very small subspace of non-eliminated coefficients (as expected for sparse estimation fed with little 
data), which it essentially remains locked in ever after. To sensibly score a candidate X^,, we have to 
reason about what happens to all coefficients, which is not possible based on a "posterior" Q(u\y) 
ruling out most of them with full certainty. 

Finally, even if the goal is point reconstruction from given data, the sparse inference posterior mean 
Eq [u I y] (obtained as byproduct it* in the double loop algorithm of Section [1]) can be an important 
alternative to an exactly sparse estimator. For the former, EQ[s|y] = Bu^: is not sparse in general, 
and the degree to which coefficients are penalized (but not eliminated) is determined by the choice 
of ti{si). To illustrate this point, we compare the mean estimator for Laplace and Student's t 
potentials (different i/ > 2) in Section 7.2 These results demonstrate that contrary to some folklore 
in signal and image processing, sparser is not necessarily better for point reconstruction of real- 
world images. Enforcing sparsity too strongly leads to fine details being smoothed out, which is not 
acceptable in medical imaging (fine features are often diagnostically most relevant) or photography 
postprocessing (most users strongly dislike unnaturally hard edges and oversmoothed areas). 

Sparse estimation methodology has seen impressive advancements towards what it is intended to 
do: solving a given overparameterized reconstruction problem by eliminating nonessential variables. 
However, it is ill-suited for addressing decision-making scenarios driven by Bayesian inference. For 
the latter, a useful (nondegenerate) posterior approximation has to be obtained without relying on 
computational benefits of exact sparsity. We show how this can be done, by reducing variational 
inference to numerical techniques (LCG and Lanczos) which can be scaled up to large problems 
without exact variable sparsity. 



6 Bayesian Experimental Design 

In this section, we show how to optimize the image acquisition matrix X by way of Bayesian 
sequential experimental design (also known as Bayesian active learning), maximizing the expected 
amount of information gained. Unrelated to the output of point reconstruction methods, information 
gain scores depend on the posterior covariance matrix Cov[u|i/] over full images u, which within 
our large scale variational inference framework is approximated by the Lanczos algorithm. 

In each round, a part X^, G M*^^" is appended to the design X, a new (partial) measurement i/* to y. 
Candidates {X^} are ranked by the information ^rairj^ score H[P(ti|y)] — Ep(^y^\y^[Il[P{u\y,y^)]], 
where P{u\y) and P(ii|y,i/*) are posteriors for {X,y) and (X U X*, y U y*) respectively, and 
P{y*\y) = J N{y^:\X^u, a'^I)P{u\y) du. Replacing P{u\y) by its best Gaussian variational ap- 
proximation Q{u\y) = N{u*,A~^) and P{u\y,y*) by Q{u\y,y^) oc N{y^\X^u,a'^I)Q{u\y), we 
obtain an approximate information gain score 

A(X,) := -log|A| +log|A + (7-2xJX,| = log |7 + tj-^x, A-^Xj] . (13) 

Note that Q{u\y, y*) has the same variational parameters 7 as Q{u\y), which simplifies and robus- 
tifies score computations. Refitting of 7 is done at the end of each round, once the score maximizer 
is appended along with a new measurement y^: . 

^ H[P(u)] — Ep[— logP(n)] is the (differential) entropy, measuring the amount of uncertainty in P{u). For a 
Gaussian, H[iV(/i, S)] = | log |27reS|. 
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With candidates of size d to be scored, a naive computation of (13) would require • d linear 



systems to be solved, which is not tractable (for example, A'^ = 240, d = 512 in Section 7.4). We 



can make use of the Lanczos approximation once more (see Section If QlAQk = Tk = L^L 



-T 

{Lk is bidiagonal, computed in 0{k)), let := a~^X^QkL^^ £ I^^. Then, A(X*) log|I + 
K.!^"^! = log|/ + ^."^141 (the latter is preferable if /c < d), at a total cost of k matrix- vector 
multiplications (MVMs) with and 0(max{/c,(i} • min{/c,d}^). Just as with marginal variances, 
Lanczos approximations of A{X^:) are underestimates, nondecreasing in k. The impact of Lanczos 
approximation errors on design decisions is analyzed in [31]. While absolute score values are much 
too small, decisions only depend on the ranking among the highest-scoring candidates X^,, which 
often is faithfully reproduced even for k <^ n. To understand this point, note that A{X^:) measures 
the alignment of X^ with the directions of largest variance in Q{u\y). For example, the single 
best unit-norm filter a;* S is given by the maximal eigenvector of CovQ[ti|7/] = A~^, which is 
obtained after few Lanczos iterations. 

In the context of Bayesian experimental design, the convexity of our variational inference relax- 
ation (with log-concave potentials) is an important asset. In contrast to single image reconstruction, 
which can be tuned by the user until a desired result is obtained, sequential acquisition optimiza- 
tion is an autonomous process consisting of many individual steps (a real-world example is given 



in Section 7.4), each of which requires a variational refitting Q(u\y) — )• Q{u\y,y^,). Within our 
framework, each of these has a unique solution which is found by a very efficient algorithm. While 
we are not aware of Bayesian acquisition optimization being realized at comparable scales with 
other inference approximations, this would be difficult to do indeed. Different variational approx- 
imations are non-convex problems coming with notorious local minima issues. For Markov chain 
Monte Carlo methods, there are not even reliable automatic tests of convergence. If approximate 
inference drives a multi-step automated scheme free of human expert interventions, properties like 
convexity and robustness gain relevance normally overlooked in the literature. 



6.1 Compressive Sensing of Natural Images 



The main application we address in Section 7.4, automatic acquisition optimization for magnetic 



resonance imaging, is an advanced real- world instance of compressive sensing (CS) [6l [5]. Given 
that real-world images come with low entropy super-Gaussian statistics, how can we tractably 
reconstruct them from a sample below the Nyquist- Shannon limit? How do small successful designs 
X for natural images look like? Recent celebrated results about recovery properties of convex sparse 
estimators |IOl El |5] have been interpreted as suggesting that up from a certain size, successful 
designs X may simply be drawn blindly at random. Technically speaking, these results are about 
highly exactly sparse signals (see Sectionjs]), yet advancements for image reconstruction are typically 
being implied P, "5]. In contrast, Bayesian experimental design is an adaptive approach, optimizing 
X based on real-world training images. Our work is of the latter kind, as are \18\ [321 [T6] for much 
smaller scales. 

The question whether a design X is useful for measuring images, can (and should) be resolved 
empirically. Indeed, it takes not more than some reconstruction code and a range of realistic images 
(natural photographs, MR images) to convince oneself that MAP estimation from a subset of 
Fourier coefficients drawn uniformly at random (say, at 1/4 Nyquist) leads to very poor results. 
This failure of blindly drawn designs is well established by now both for natural images and MR 
images |32t l34 t[T9l [7]. and is not hard to motivate. In a nutshell, the assumptions which current CS 
theory relies upon do not sensibly describe realistic images. Marginal statistics of the latter are not 
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exactly sparse, but exhibit a power law (super-Gaussian) decay. More important, their sparsity is 
highly structured, a fact which is ignored in assumptions made by current CS theory, therefore not 
reflected in recovery conditions (such as incoherence) or in designs X drawn uniformly at random. 
Such designs fail for a number of reasons. First, they do not sample where the image energy is 
|321 [7]. A more subtle problem is the inherent variability of independent sampling in Fourier space: 
large gaps occur with high probability, which leads to serious MAP reconstruction errors. These 
points are reinforced in |32^ 134) . The former study finds that for good reconstruction quality of 
real-world images, the choice of X is far more important than the type of reconstruction algorithm 
used. 

In real-world imaging applications, adaptive approaches promise remedies for these problems (other 
proposals in this direction are [15] and which however have not successfully been applied to 
real- world images). Instead of relying on simplistic signal assumptions, they learn a design X 
from realistic image data. Bayesian experimental design provides a general framework for adaptive 
design optimization, driven not by point reconstruction, but by predicting information gain through 
posterior covariance estimates. 



7 Experiments 

We begin with a set of experiments designed to explore aspects and variants of our algorithms, and 
to understand approximation errors. Our main application concerns the optimization of sampling 
trajectories in magnetic resonance imaging (MRI) sequences, with the aim of obtaining useful 
images faster than previously possible. 



7.1 Type A versus Type B Bounding for Laplace Potentials 

Recall that the critical coupling term log | A| in the variational criterion ^(7) can be upper bounded 



in two different ways, called type A and type B in Section 4.2 Type A is tight for moderate and large 
7j, type B for small 7, (Section [5]). In this section, we run our inference algorithm with type A and 
type B bounding respectively, comparing the speed of convergence. The setup (SLM with Laplace 
potentials) is as detailed in Section [7^ with a design X of 64 phase encodes (1/4 Nyquist). Results 
are given in Figure [4| averaged over 7 different slices from sg88 (256 x 256 pixels, n = 131072). 

In this case, the bounding type strongly influences the algorithm's progress. While two outer loop 
(OL) iterations suffice for convergence with type A, convergence is not attained even after 20 OL 
steps with type B. More inner loop (IL) steps are done for type A (30 in ffist OL iteration, 3-4 
afterwards) than for type B (5-6 in first OL iteration, 3-4 afterwards). The double loop strategy, 
to make substantial progress with far less expensive IL updates, works out for type A, but not 
for type B bounding. These results indicate that bounding type A should be preferred for SLM 
variational inference, certainly with Laplace potentials. Another indication comes from comparing 
IL penalties h*{si) respectively. For type A, h*{si) = n^zi^i + sf)^^"^ is sparsity-enforcing for small 
zi^i, retaining an important property of (pi'j), while for type B, h*{si) does not enforce sparsity at 



all (see Appendix A. 6) 
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Figure 4: Comparison of bounding types A, B for SLM with Laplace potentials. Shown are ^(7) criterion 
values (left) and £2 errors of posterior mean estimates (not MAP, as in Section 7.4), at the end of each outer 
loop iteration (starting from second). 



7.2 Student's t Potentials 

In this section, we compare SLM variational inference with Student's t ^ potentials to the Laplace 
setup of Section 7.1 Student's t potentials are not log-concave, so neither MAP estimation nor 
variational inference are convex problems. Student's t potentials enforce sparsity more strongly than 
Laplacians do, which is often claimed to be more useful for image reconstruction. Their parameters 
are v (degrees of freedom; regulating sparsity) and a = vjr (scale). We compare Laplace and 
Student's t potentials of same variance (the latter has a variance for > 2 only): = 2(i/-2)/r2, 
where is the Laplace parameter, a^, respectively. The model setup is the same as in Section[7T} 
using slice 8 of sg88 only. Result are given in Figure [5j 

Compared to the Laplace setup, reconstruction errors for Student's t SLMs are worse across all 
values of v. While v = 2.1 outperforms larger values, the reconstruction error grows with iterations 
for = 2.01, I' = 2.001. This is not a problem of sluggish convergence: 0(7) decreases rapidljj^ 
in this case. A glance at the mean reconstructions (|u=i,^j|) (Figure [5| lower row) indicates what 
happens. For v = 2.01, 2.001, image sparsity is clearly enforced too strongly, leading to fine features 
being smoothed out. The reconstruction for v = 2.001 is merely a caricature of the real image 
complexity, and rather useless as the output of a medical imaging procedure. When it comes to 
real-world image reconstruction, more sparsity does not necessarily lead to better results. 



7.3 Inaccurate Lanczos Variance Estimates 



The difficulty of large scale Gaussian variance approximation is discussed in Section 4.4, In this 
section, we analyse errors of the Lanczos variance approximation we employ in our experiments. 
We downsampled our MRI data to 64 x 64, to allow for ground truth exact variance computations. 
The setup is the same as above (Laplacians, type A bounding), with X consisting of 30 phase 
encodes. Starting with a single common OL iteration, we compare different ways of updating zi: 



For Student's t potentials (as opposed to Laplacians) 
experiments. 



type A and type B bounding behave very similar in these 
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Figure 5: Comparison of SLM with Student's t and Laplace potentials (type A bounding). Shown are 
potential density functions (upper left), £2 errors of posterior mean estimates (upper right), over 8 outer 
loop iterations. Lower row: Posterior mean reconstructions (|w*,i|) after 8 OL iterations: v — 2.1; ground 
truth; V = 2.01; v ^ 2.001. 



exact variance computations versus Lanczos approximations of different size k. Results are given 
in Figure [6] (upper and middle row) . 

The spectrum of A at the beginning of the second OL iteration shows a roughly linear decay. Lanc- 
zos approximation errors are rather large (middle row). Interestingly, the algorithm does certainly 
not work better with exact variance computations (judged by the development of posterior mean 
reconstruction errors, upper right). We offer a heuristical explanation in Section 4.4 A clear struc- 
ture in the relative errors emerges from the middle row: the largest (and also smallest) true values 
Zi are approximated rather accurately, while smaller true entries are strongly damped. The role of 
sparsity potentials ti(si), or of 7^ within the variational approximation, is to shrink coefficients se- 
lectively. The structure of Lanczos variance errors serves to strengthen this effect. We repeated the 
relative error estimation for the full-scale setup used in the previous sections and below (256 x 256), 
ground truth values Zi were obtained by separate conjugate gradients runs. The results (shown in 
the lower row) exhibit the same structure, although relative errors are larger in general. 

Both our experiments and our heuristic explanation are given for sparse linear model inference, 
we do not expect them to generalize to other models. Within the same model and problem class, 
the impact of Lanczos approximations on final design outcomes is analyzed in |3T]- As noted in 



Section 4.4, understanding the real impact of Lanczos (or PCA) approximations on approximate 
inference and decision-making is an important topic for future research. 
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Figure 6: Lanczos approximations of Gaussian variances, at beginning of second OL iteration. For 64x64 data 
(upper left), spectral decay of inverse covariance matrix A is roughly linear (upper middle). £2 reconstruction 
error of posterior mean estimate after subsequent OL iterations, for exact variance computation vs. k = 
250,500,750,1500 Lanczos steps (upper right). Middle row: Relative accuracy Zi 1-^ Zk.i/zi at beginning of 
second OL iteration, separately for "a" potentials (on wavelet coefficients; red), "r" potentials (on derivatives; 
blue), and "i" potentials (on 3(m); green), see Section 7.4 Lower row: Relative accuracy Zi 1— > z^^i/zi 
at beginning of second OL iteration, for full size setup (256 x 256), k = 500,750,1500 (ground truth z^ 
determined by separate LCG runs). 



7.4 Sampling Optimization for Magnetic Resonance Imaging 

Magnetic resonance imaging [35] is among the most important medical imaging modalities. Without 
applying any harmful ionizing radiation, a wide range of parameters, from basic anatomy to blood 
flow, brain function or metabolite distribution, can be visualized. Image slices are reconstructed 
from coefficients sampled along smooth trajectories in Fourier space {phase encodes). In Cartesian 
MRI, phase encodes are dense columns or rows in discrete Fourier space. The most serious limiting 



factoi is long scan time, which is proportional to the number of phase encodes acquired. MRI is 



a prime candidate for compressive sensing (Section 6.1) in practice [T9l[3l]: if images of diagnostic 
quality can be reconstructed from an undersampled design, time is saved at no additional hardware 
costs or risks to patients. 



Patient movement (blood flow, heartbeat, thorax) is strongly detrimental to image quality, which necessitates 
uncomfortable measures such as breath-hold or fixation. In dynamic MRI, temporal resolution is limited by scan 
time. 
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In this section, we address the problem of MRI samphng optimization: which smallest subset of 
phase encodes results in MAP reconstructions of useful quality? To be clear, we do not use ap- 
proximate Bayesian technology to improve reconstruction from fixed designs (see Section [5]) , but 
aim to optimize the design X itself, so to best support subsequent standard MAP reconstruction 
on real- world images. As discussed in Section [5] and Section 6.1, the focus for most work on com- 



pressive sensing is on the reconstruction algorithm, the question of how to choose X is typically 
not addressed (exceptions include \18\ I16j). We follow the adaptive Bayesian experimental design 
scenario described in Section |6] where {-X^*} are phase encodes (columns in Fourier space), u the 
unknown (complex- valued) image. Implementing this proposal requires the approximation of dom- 
inating posterior covariance directions for a large scale non-Gaussian SLM (n = 131072), which 
to our knowledge has not been attempted before. Results shown below are part of a larger study 
|34| on human brain data acquired with a Siemens 3T scanner (TSE, 23 echos/exc, 120° refocus- 
ing pulses, 1 x 1 x 4mm^ voxels, resolution 256 x 256). Note that for Nyquist dense acquisition, 
resolution is dictated by the number of phase encodes, 256 in this setting. We employ two datasets 
sg92 and sg88 here (sagittal orientation, echo time «90ms). 

We use a sparse linear model with Laplace potentials ([2]). In MRI, u, y, X, and s = Bu are 



naturally complex- valued, and we make use of the group potential extension discussed in Section 4.5 
(coding C as M^). The vector s is composed of multi-scale wavelet coefficients Sa, first derivatives 
(horizontal and vertical) s^, and the imaginary part Sj = '^[u). A matrix- vector multiplication 
(MVM) with X requires a fast Fourier transform (FFT), while an MVM with B costs 0{n) only. 
Laplace scale parameters were = 0.07, t^. = 0.04, Tj = 0.1). The algorithms described above 
were run with n = 131072, q = 261632, candidate size d = 512, and m = d ■ N^oi, where N^oi 
is the number of phase encodes in X. We compare different ways of constructing designs X, all 
of which start with the central 32 columns (lowest horizontal frequencies): Bayesian sequential 
optimization, with all remaining 224 columns as candidates (op); filling the grid from the center 
outwards (ct; such low-pass designs are typically used with linear MRI reconstruction); covering 
the grid with equidistant columns (eq); and drawing encodes at random (without replacement), 
using the variable-density sampling approach of [19] (rd). The latter is motivated by compressive 



sensing theory (see Section 6.1), yet is substantially refined compared to naive i.i.d. samplingj^ 
Results for sparse MAP reconstruction of the most difficult slice in sg92 are shown in Figure [T] (the 
error metric is £2 distance — |wtrue|||, where iitrue is the complete data reconstruction). 

Obtained with the same standard sparse reconstruction method (convex ii MAP estimation) , results 
for fixed A'coi differ "only" in terms of the composition of X (recall that scan time grows proportional 
to Nco\)- Designs chosen by our Bayesian technique substantially outperform all other choices. These 
results, along with |32|, I34j. are in stark contrast to claims that independent random sampling is a 
good way to choose designs for sub-Nyquist reconstruction of real-world images. The improvement of 
Bayesian optimized over randomly drawn designs is larger for smaller A^coi- In fact, variable-density 
sampling does worse than conventional low-pass designs below 1/2 Nyquist. Similar findings are 
obtained in |32j for different natural images. In the regime far below the Nyquist limit, it is all the 
more important to judiciously optimize the design, using criteria informed about realistic images 
in the first place. 

A larger range of results is given in [34j . Even at 1/4 Nyquist, designs optimized by our method lead 
to images where most relevant details are preserved. In Figure [7| testing and design optimization 



Results for drawing phase encodes uniformly at random are much worse than the alternatives show, even if 
started with the same central 32 columns. Reconstructions become even worse when Fourier coefficients are drawn 



uniformly at random. 
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Figure 7: Results for Cartesian undersampling with different measurement designs, on sagittal slice 
(TSE, TE=92ms). All designs contain 32 central columns. Equispaced [eq]; low-pass [ct]; random 
with variable density [rd] (averaged over 10 repetitions); optimized by our Bayesian technique [op]. 
Shown are ii distances to tttrue for MAP reconstruction with the Laplace SLM. Designs optimized 
on the same data. 

is done on the same dataset. The generalization capability of our optimized designs is tested in 
this larger study, applying them to a range of data from different subjects, different contrasts, and 
different orientations, achieving improvements on these test sets comparable to what is shown in 
Figure [7| Finally, we have concentrated on single image slice optimization in our experiments. In 
realistic MRI experiments, a number of neighbouring slices is acquired in an interleaved fashion. 
Strong statistical dependencies between slices can be exploited, both in reconstruction and joint 
design optimization, by combining our framework with structured graphical model message passing 

8 Discussion 

In this paper, we introduce scalable algorithms for approximate Bayesian inference in sparse linear 
models, complementing the large body of work on point estimation for these models. If the Bayesian 
posterior is not just taken for a criterion to be optimized, but as global picture of uncertainty in a 
reconstruction problem, advanced decision-making problems such as model calibration, feature rele- 
vance ranking or Bayesian experimental design can be addressed. We settle a long-standing question 
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for continuous-variable variational Bayesian inference, proving that the relaxation of interest here 
|17l [23t [TT] has the same convexity profile than MAP estimation. Our double loop algorithms are 
scalable by reduction to common computational problems, penalized least squares optimization and 
Gaussian covariance estimation (or principal components analysis). The large and growing body 
of work for the latter, both in theory and algorithms, is put to novel use in our methods. More- 
over, the reductions offer valuable insight into similarities and differences between sparse estimation 
and approximate Bayesian inference, as do our focus on decision-making problems beyond point 
reconstruction. 

We apply our algorithms to the design optimization problem of improving sampling trajectories for 
magnetic resonance imaging. To the best of our knowledge, this has not been attempted before in 
the context of sparse nonlinear reconstruction. Ours is the first approximate Bayesian framework 
for adaptive compressive sensing that scales up to and succeeds on full high-resolution real-world 
images. Results here are part of a larger MRI study [HI], where designs optimized by our Bayesian 
technique are found to significantly and robustly improve sparse image reconstruction on a wide 
range of test datasets, for measurements far below the Nyquist limit. 

In future work, we will address advanced joint design scenarios, such as MRI sampling optimization 
for multiple image slices, 3D MRI, and parallel MRI with array coils. Our technique can be sped 
up along many directions, from algorithmic improvements (advanced algorithms for inner loop 
optimization, modern Lanczos variants) down to parallel computation on graphics hardware. An 
important future goal, currently out of reach, is supporting real-time MRI applications by automatic 
on-line sampling optimization. 
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A Details and Proofs 
A.l Proof of Theorem [l] 

In this section, we provide a proof of Theorem [l] For notational convenience, we absorb o"^^ into 
X^X, by replacing X by a^^X. We begin with part ([2|. It is well known that tt i— t- log | A(7r)| is 
concave and nondecreasing for tt ;^ f5I Sect. 3.1.5]. Both properties carry over to the extended- 



value function, ^'^ The statement follows from the concatenation rules of |31 Sect. 3.2.4]. 

We continue with part ([l|. Write A = A{f {-/)), ipi := log] A], T = diag7, and /(F) = 
diag/(7). First, 7 i— -(/'i is the composition of twice continuously differentiable mappings, 
thus inherits this property. Now, dipi = tv Sf'{T){dT), where S := BA , moreover 
fi2^i = -trSf{T){dT)Sf'{T){dT) + ti Sf"{T){dT)^ = tr{dT)S{dT)Ei, where Ei := f"{T) - 



In general, we extend convex continuous functions /(tt) on tt ^ by /(tt) = limd\,7r f{d), tt 0, and /(tt) = 00 
elsewhere. 
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f'{T)Sf{T). Since 5 ^ 0, we have S = for some matrix V, and d^V'i = tT{{dT)V)^ Ei{dT)V. 
Now, if El >z 0, then for 7^*) = 7 + *(A7), we have Vi(0) = tiN'^EiN > for any A7, where 
AT" := (diag A7)V^ so that -01 is convex. 

The log-convexity of fi{ji) imphes that /i(7i)/j"(7i) > (/i(7i))^ ^^i all ji, so that 

£;i = f(T)~\fiT)f"(T)) - /'(r)5/'(r) /(r)-i(/'(r))2 - /'(r)5/'(r) 
= /(r) (/(r)-i - s) f'{T). 

Therefore, it remains to be shown that f{T)-^ - ShO.We use the identity 

v'^M'^v = max2i>^a; - x^Mx, (14) 



which holds whenever M >- 0. For any r E M"^: 

-1 



r^BA B^r = max2r^Bx - x^ {X'^X + B^f{T)B) x < max 2r^fc - k^f{T)k, 



k=Bx 



using (h4) and x^ X^ Xx = \\Xx\y > 0. Therefore, Sr < max^ 2r^k-k^ /(r)fe = /(r)~V 



using (|14 ) once more, which implies /(F) ^ — S ^ 0. This completes the proof of part Since 
A = A(7~^), we can employ this argument with fii'ji) = and r = Si in order to establish part 

We continue with part Write A = A(/(7)~-^) and ip2 ■= l"^(log/(7)) + log|A|. Assume for 
now that X'^X >- 0. Let B = (B'^^b)^ (so that is the last row of B), and define A^g = 
X^X + B^J(T<g)-^B<g, where /(7<q) = (/i(7i))i<q e M^~\ We make use of the well-known 
determinant identity |/ + vv'^\ = 1 -|- v'^v. Namely, 

log + log A<, + Mi.r'bb'' 



l0g/g(7g) +l0g 



+ log 



log 
log 



'■«7 



I + fgi^grHA^g)-W 

+ log/,(7g) + log (1 + /g(7,)-'b^(A<g)-i6 
+ log (/,(7,) + fa^(A<,)-ib). 



(15) 



Since the extended-value function log(-) (assigning —00 to arguments < 0) is concave and nonde- 
creasing, the concatenation rules of Sect. 3.2.4] imply the concavity of the final term in (15) 
whenever fqi'jq) + b'^(A<„)~^b is concave. We will use induction on q, the dimensionality of 7. 
For q = 1, 1^2 is given by (15) with A<i = X'^X, and its concavity follows from the concavity of 
/i(7i). For q > 1, (15) implies 



i;2 = 1' (log /(7<«)) + log I A<,| + log ^fgijg) + b' (A<,)-^6j . 

Both the sum of the first two terms and fqi'jq) are concave by assumption, so that the concavity 
of Tp2 is implied by the concavity of 7 1— )• b^{A^g)~^b. Using (14), we have 



b^{A^q) ^b = max 26^03 — x'^A^gX = max2b^x — \\Xx\ 



with V := B^gX. Now, {x, f) 1— )• 2b^x — WXxW^ — v^{diagf) is jointly concave for / ^ (see 
proof of Theorem^, so that k(/) := 6^A<g(/~^)~^6 is concave for / ^ [H Sect. 3.2.5] (recall 
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that A<q(/~^) = X'^X + S^g(diag /~^)S<q). To finish the argument, we plug in / := f{'y<,q) 
and use the concatenation theorems of [U Sect. 3.2.4]. What remains to be shown in this context is 
that is nondecreasing in each argument. Pick any i G {1, . . . , q}, f y 0, and any A > 0. Then, 

K{f+ASi) = (A<,(ri) - -j-^^-^^h,hj^ h > h^A^.if-^r^b = 

where hi = B^5i. This concludes the proof of part ([s]), under the assumption that X^X is invert- 
ible. If X^X is singular, define ip2 above, but with X^ X — t- X"^ X + el. We saw that t/'f 
concave for any e > 0. For any 7 ;^ such that ^^2(7) > —00, V'l converges uniformly to ^^2 on a 
closed environment of 7 (^2 and all are continuous), so that ip2 is concave at 7. This completes 
the proof of part ([s]). 



A.2 Proof of Theorem |3] 

In this section, we provide of proof of Theorem [sj We begin with part ([T]), focussing on a single 
potential i and dropping its index. Since 5 7^ is dealt with separately, assume that t{s) is even: 
logt(s) = g{s) = g{s^) = g{x), where x = s^. If /(x,7) := — x/7 — 2g{x) and f{s,'y) := — s^/7 — 
2g{s), then h{-y) = maxx>o f {x , 'j) , and f{x,^) = f{x^^'^,j). It suffices to consider s > 0. Denote 
x^: = x=k(7) := argmax^.>o 7(3;, 7) (unique, since g{x) is strictly convex). If 70 := sup{7 | /(x,7) < 
—2g{0) Vx} (70 = for an empty set), then 

• X* = 0, h{j) = -2g{0) for 7 G (0,70] 

• x=K > 0, h{'y) strictly increasing for 7 > 70 

Namely, if 70 < 71 < 72, then x*(7i) > by definition of 70, and /i(7i) = — x*(7i)/7i — 2g(x=i,(7i)) < 
-2;*(7i)/72 - 25(x*(7i)) < -x*(72)/72 - 2g{x^{j2)) = Hl^)- Note that 70 > iff lim£^o5'(^) is 
finite. It suffices to show that h is convex at all 7 > 70, where x* = > 0. 

We use the notation fs = df/{ds), functions are evaluated at {x^ = 5^,7) if nothing else is 
said. Now, fs = —2s^:/j — 2^s(s*) = 0, so that 5s(s*) = — s*/7. Next, g{x) is twice continuously 
differentiable, and x* = si at 7. Therefore, fx = df/{dx) is continuously differentiable. Moreover, 
9x,x{x) > by the strict convexity of ^(x). By the implicit function theorem, x=i,(7) is continuously 
differentiable at 7, and since /i(7) = /(x*(7),7), h'{^) exists. Moreover, = {d/d^)fx{x^:{'y),^) = 
fx,'r + fx,x ■ idx^)/{d'y), so that {dx^)/{d-i) = -f'"^ /{2gx^xix*)) > 0: x^,(7) is increasing. From fx = 0, 
we have that h^-y) = f^y = = (5s(s*))^, since ffs(s*) = —s^/'-f. Now, gs{s) is nonincreasing 

by the concavity of g{s), and ffs(s=i,) < 0, so that s* 1— h^j) is nondecreasing. Since = x* is 
increasing in 7, so is s*. Therefore, 7 1— )• h'('y) is nondecreasing, which means that h{'y) is convex 
for 7 > 7o. 

The concavity of g{s) is necessary. Suppose that 5s,s(s) > for some s > 0. If x = s^/^, g{x) is 
differentiable at x, and if 7 = —l/{2g'{x)), then 5=1,(7) = s. But if gs,s{s*) > at 7, then s* 1— h^j) 
is decreasing at s* = s, and just as above 7 1— )• h'{'y) is decreasing at 7, so that h is not convex at 
7. This concludes the proof of part ([l]). 

Part ^ is a direct consequence of part ([T]) and Theorem [2j For the final statement, suppose that 
h[{'yi) is decreasing at 7j = 7i. Pick the other coefficients in 7 7 arbitrary, and choose m = n = 1, 
y = 0, X = X, B = Si, so that (t>{'y) — h{'y) = r{ji) := log(l + X'^'ji) — log7i, ignoring additive 
constants. Consider (f>{t) = ^(7 + tSi). Since r'{^i) = X^/{1 + X'^^i) X ^ 00, cf)'{t) 

is decreasing at t = for large enough X, and (j) is not convex at 7. 
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A. 3 Proof of Theorem |4] 



In this section, we provides proofs related to Section[5] We begin with Theorem[4} For the first part, 
fix any n >- 0, any i G {1, . . . ,q}, and any A > 0. If 6j = B^6i / 0, then using the determinant 
identity previously employed in Appendix |A.1[ we have 

log I A(7r + ASi)\ = log I A(7r)| + log 1/ + AA{7v)-\bf 

= log I A(7r)| + log(l + AbfA{7v)-\) > log | A(7r)|, 

since bj A{Tv)~^bi > and log(l + x) > for x > 0. Therefore, log|A(7r)| is increasing in each 
component. Moreover, we have that log |A(7r + A(5j)| — )• oo (A — )• oo), since log(l + x) is unbounded 
above for x — )• oo. If tt^ is a sequence with ||7rt|| — )• oo and tt^ ^ el, there must be some i £ {1, . . . , q} 
such that {7Vt)i — oo. If tt^ := el + {{7Vt)i — e)Si >: nt, then log | A(7rt)| > log | A(7rt)| — >• oo (t — )■ oo). 

For the second part, recall that 

0(7) = loglAI + hH) + mm \R(u,j) = a~'^\\y - Xuf + s'^T~'^s - 2b^s] , 
— 2 log P(u\y) = min /i(7) + R{u,j) + C2, s = Bu 

for some constant C2. If "ft is a bounded sequence such that {'yt)i — (t — )• 00) for some 
i G {!,..., g}, then log|A(7j)| = log|A(7j~^)| — )• 00. Suppose that 0(7t) remains bounded 
above. Let Ut = aigmin^ R{u , -ft) . Then, (/>(7t) — log|A(7f)| = /i(7j) + R{ut,jt) — ^ —00, so that 
—2log P{ut\y) — C2 < h{-yt) + R{ut^ft) — ^ —00, in contradiction to the boundedness of the log 
posterior. This concludes the proof. 



Next, assume we run MAP estimation (10) with even super-Gaussian potentials tt(si), so that 
\si\ I-)- — logtj(sj) = —gi{si) is concave. As argued in Section[5| any local minimum point = Bu^, 
is exactly sparse. We show that the corresponding 7* has the same sparsity pattern: 7^,^j = 
whenever s*^j = 0. Dropping the index, since 7=,, G argmin^>o ■^^/t + ^(7)1 we have to show that 



/i(7) > /i(0) for all 7 > (or, in terms of Appendix A.2| that 70 = 0). Fix 7 > 0, and recall that 



/i(7) = max,>o{/(s, 7) = -s'h " 25(^)1 > -25(0) = /i(0). Now, f{s, 7) = -2~g{s) + O(s^), s \ 0, 
where —2g{s) is concave, nondecreasing and not constant. Therefore, \\ms\Qdf/{ds) G (0,oo], and 
/(s, 7) > /(0,7) for some s > 0, so that h{^) > /(s, 7) > h{0). 



A. 4 Details for Bounding Iog|A| 



In this section, we provide details concerning the log|A| bounds discussed in Section 4.2 Recall 
that A(7r) = a^'^X'^X + B"^(diag7r)S for tv >- 0. Define the extended- value extension 51(77) = 
limd\7r log I A(d)|, TT ^ 0, gi{Tv) = —00 elsewhere (note that log|A(7r)| is continuous). Since gi 
is lower semicontinuous, and concave for n >~ (Theorem [T]|2]) ) , it is a closed proper concave 
function. Fenchel duality [261 Sect. 12] implies that (71(77) = inf^^ zfn — gl{zi), where gl{zi) = 
inf^ zJtt — gi{7v) is closed concave as well. As gi{Tv) is unbounded above as ||7r|| — )• 00 (TheoremjZj), 
zfn — gi{7v) is unbounded below whenever zi^i < for any i, and gi{zi) = —00 in this case. 



Moreover, for any tv >- 0, the corresponding minimizer zi^^ is given in Section 4.2, so that gi{7v 



mm^-^yozf-K - gl[zij 



Second, define the extended-value extension 52(7) = limd\7 + log |A(<i)|, 7^0, 52(7) = 

—00 elsewhere (note that l"^(log7r) + log|A(7r)| is continuous). Since 52 is lower semicontinuous. 
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and concave for 7^0 (Theorem [T]|3])), it is a closed proper concave function. Fenchel duality 
[26l Sect. 12] implies that ^'2(7) = infz2 2:^7 — ^2(^2), where 52(^2) = inf^ 2;|"7 — 92(7) is closed 
concave as well. Since — 52(7) is unbounded below whenever Z2,i < for any i, we see that 
52 (-^2) = —00 in this case. For any 7 )^ 0, the corresponding minimizer 2:2,* is given in Section 
so that 52(7) = miiizabo ^2^7 - 52(22)- 

A. 5 Implicit Computation of hi and h\ 



4.2 



Recall from Section |4] and Section 4.3 that our algorithms can be run whenever h*{si) and its 



derivatives can be evaluated. For log-concave potentials, these evaluations can be done generically, 
even if no closed form for /ij(7i) is available. We focus on a single potential i and drop its index. 
As noted in Section [4| if Z2 = zs = 0, then h*{s) = —g{zi + s^). With p := zi + s^, we have that 



-2g'{p)s-b, p = -Ag"{p)s'^ -2g'{p). With a view to Appendix |A.7| 9 = -2g'{p), p = zi + \\s\ 
and K = 2[5"(p)]V2. 

If Z2 7^ (and t{s) log-concave), we have to employ scalar convex minimization. We require h*{s) = 
^ min-y k{x, 7), /c := {zi+x)/^+Z2j — Z3 log7 + /i(7), x = s^, as well as 9 = {h*)'{s) and p = {h*)"{s). 
Let 7=f = argmin A;(x, 7). Assuming for now that h and its derivatives are available, 7* is found 
by univariate Newton minimization, where 7^/2^ = —{zi + x) — Z37 + 7^(2^2 + ^'(7))) 7^^7,7 = 
2{zi -|- x) + 7^3 + 7^/i"(7). Now, A:^ = (always evaluated at (x,7=k)), so that 9 = (/i*)'(s) = 5/7=,,. 
Moreover, = {d/ds)k^ = ks^y + k^^'y- {d'y^)/{ds), so th at p = {h*)"{s) = {I — s^^^ {d'y^) / (ds)) 



7^ ^(1 — 2x/(7f^,^(x,7*))). With a view to Appendix A. 7, 9 = I/7* and k = [2/(7^A;^^^(x, 7*))]-*^/^ 
(note that 9 > p). 

By Fenchel duality, /i(7) = — min^; Z(x, 7), / := x/7 + 2g{x), where g{x) is strictly convex and 
decreasing. We need methods to evaluate g{x) and its first and second derivative (note that g"{x) > 
0). The minimizer x=k = x*(7) is found by convex minimization once more, started from the last 
recently found x* for this potential. Note that x=k = iff 7 < 70 := —l/{2g'{0)) (where 70 = 
if g'{x) — )■ —00 as X —7- 0), which has to be checked up front. Given x,,,, we have that 7/1(7) = 
— x^, — 2jg{x^:). Since Z^, = for 7 > 70 (always evaluated at (x*,7)), then 7^/1' (7) = —7^/7 = x^: 
(this holds even if Z^; > and x* = 0). Moreover, if x* > (for 7 > 70), then {d/d'y)lx{x*,y) = 0, 
so that {dx^)/{d-f) = 'y-^/(2g"{x^)), and 7^/1" (7) = (2751" (x,^))^^ - 2x=^. If x* = and l^c > 0, then 
x=k(7) = for 7 close to 7, so that /i"(7) = 0. A critical case is x* = and Ix = 0, which happens 
for 7 = 7o > 0: h"{y) does not exist at this point in general. This is not a problem for our code, 
since we employ a robust Newton/bisection search for 7*. If 7 > 7*, but is very close, note that 
(dx,)/(d7) « eo/7 with ^0 := -g'{0)/g"{0), therefore x,(7) « ^^^o/tdt = eo(log7 " log7o). We 
use 7^/i'(7) = X* ,^o(log7 — log 70) and 7^/1" (7) w ^0 — 2x=k in this case. 

A. 6 Details for Specific Potentials 

Our algorithms are configured by the dual functions /ii(7i) for each non-Gaussian ti{si), and the 



inner loops require h*{si) and its derivatives (see (12), and recall that for each i, either zi^i > 
and z^^i = 0, or zi^i = and Z2,i > 0, 23^4 = 1). In this section, we show how these are computed for 
the potentials used in this paper. We use the notation of Appendix ] A. 5 focus on a single potential 
i and drop its index. 
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Laplace Potentials 



These are t{s) = exp(— r|s|), r > 0, so that g{x) = tx^I"^ . We have that /i(7) = huij) = t^7, so 
that k{x,^) = {zi + x)/7 + (z2 + t"^)! — z^log^. The stationary equation for 7* is (22 + t^)7^ — 



•Z37 — {zi + x) = 0. If 23 = (bounding type A), this is just a special case of Appendix A. 5 With 
p := zi + X, q := {z2 + t'^)^/'^, we have that 7* = p^ {'^/q, h*{s) = qp^^"^, and 6 = {h*y{s) = qp~^/'^s, 



p = {h*)"{s) = qzip With a view to Appendix A. 7, 9 = qp -^/^ and k = [qp 3/2ji/2_ 

If Z3 = 1 (bounding type B), note that zi = 0, Z2 > 0. Let q := 2{z2 + r^), p := {1 + Iqx)^/"^. 
Then, 7* = (p + !)/<?, and /c(x,7*) = p — log(p + 1) + logg after some algebra, so that h*{s) = 
\{p — log(p + 1) + log q). With dp/ (ds) = 2qp~^s, we have 9 = qs/ {p + 1), p = q/ {p{p + 1)). With a 



view to Appendix A. 7, 9 = q/ (p+1). Using p^ — l = 2xq, some algebra gives k = (2/p)^/^g/ (p+1) 
{2/py/^. 



Student's t Potentials 



These are t{s) = (1 + (r/z^)x) ('^+1)/^^ > 0, r > 0. If a := z^/r, the critical point of Appendix A. 2 
is 70 := a/(z^ + 1), and /i(7) = [a/7 + (i^ + l)log7 + C]I{^>^g} with C := -{u + l)(log7o + 1). 
h{'y) is not convex. We choose a decomposition such that /iu(7) is convex and twice continuously 
differentiable, ensuring that h*(s) is continuously differentiable, and the inner loop optimization 
runs smoothly. Since /i(7) does not have a second derivative at 70, neither has /in (7). 

^ / ('^ + 1 - ^3) log7 I 7 > 70 

\ (2(z. + l)-Z3)log7-a(7-7o)-6 I 7 < 7o ' 

u ^ i + C I 7 > 70 

uvr; I _2(j, + i)iog^ + a(7-7o) + 6 | 7 < 7o ' 

where b := {u + 1) log 70, a := {u + l)/7o- Here, the — 2;3log7 term of k{x,^) is folded into /in(7)- 



We follow Appendix A. 5 in determining h*{s) and its derivatives, but solve for 7* directly. Note that 
Z2 > even if 23 = (bounding type A), due to the Fenchel bound on /in(7)- We minimize k{x, 7) for 
7 > 70) 7 < 7o respectively and pick the minimum. For 7 > 79: k{x, 7) = (zi + a + x)/7 + Z27 + C, 
whose minimum point 7=k^i := [{zi + a + x)/z2]^^'^ is a candidate if 7*,i > 70, with A;(a;,7*^i) = 
2[z2{zi + a + x)Y/^ + C . For 7 < 70: fc(x,7) = (zi + x)/7 + (2:2 + 0)7 - 2(z^ + 1) log7 + 6- 070, with 
minimum point 7*^2 := + + + {z2+a){zi+x))^^'^]/{z2 + a) < 70. If Z2 < a, then 7*^2 > 7o 
(not a candidate). This can be tested up front. If c := {z2 + a){zi + x), d := ((z^ + 1)^ + c)^/^ > + 1, 
then k{x, 7=,,2) = c/(i/ + 1 + d) + d + (z^ + 1) [2 log(z2 + a) - 2 log(i/ + 1 + d) + log 70] = 2d + (1/ + 



l)[21og(2;2 + a) — 21og(z^ + 1 + d) + log 70 — 1]. Now, 9 and p are computed as in Appendix A. 5 
(^1(7) there is h\j{^) here, and 23 = 0, since this is folded into /in here), where 7^/1" (7*) = 2a for 
7* > 70, and 7^/1(^(7*) = 2{v + 1)7=^ for 7=^ < 70. 

Bernoulli Potentials 

These are t{s) = (1 + e-y^'')-^ = e2^^^/2(2 coshu)-^ v := (yr/2)x^/2 = (yr/2)|s|. They are not 
even, h = yr 12. While /i(7) is not known analytically, we can plug in these expressions into the 



generic setup of Appendix A. 5, Namely, g{x) = — log(cosht;) — log2, so that g'{x) = — C(tanhf)/f, 
g"{x) = {C /2)x~^ {{tanhv) /v + tanli^ V — 1), C := (yr/2)^/2. For x close to zero, we use tanhiJ = 
V — + 2?;^/15 + 0{v'^) for these computations. Moreover, 70 = 1/(2C) and ^0 = 3/(2C). 



30 



A. 7 Group Potentials 



An extension of our framework to group potentials is described in Section 4.5 Recall 

the details about the IRLS algorithm from Section [4.3[ For group potentials, the inner Hessian is 
not diagonal anymore, but of simil arly simple form developed here. h*{si) becomes and 



Xi = If 6i, Pi are as in Section 4.3, dsi — )• d||sj||, and 9i := 9i/\\si\\, we have that V^./i* = 9iS- 



since VsJ|sj|| = Therefore, the gradient 6 is given by Oi = OiSi. Moreover, 

= e^I -{Oi- pi)\\s.i\l^SisJ. 



For simplicity of notation, assume that all Sj have the same dimensionality. From Appendix A. 5 
we see that 9i > pi. Let Kj := {9i — pi)^^'^ /\\si\\, and s := ((diag^) (S> I)s. The Hessian w.r.t. s is 

H^"'' = (diag0) / -^Wiwf, Wi = {6iSf I)s. 

i 

If w is given by Wi = sjvi, then H^^'^v = ((diag0) I)v — ((diagw) I)s. The system matrix 
for the Newton direction is cr" -^X^X + B^H'^^^B. For numerical reasons, 9i and should be 
computed directly, rather than via 9i, pi. 

If Si G M^, we can avoid the subtraction in computing H^^^v and gain numerical stability. Namely, 
VVs.h* =piI + Kf {\\sifl-Sisf). Since \\sifl - Sisf = Msi{Msif , M = d2Sf - diS^ , if we 
redefine s := ((diagK) (g) (^2^^ — then 



H^^'^v = ((diagp) (g) I)v + ((diagw) ® I)s, w 



-T 
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